683 research outputs found
Do unbalanced data have a negative effect on LDA?
For two-class discrimination, Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562] claimed that, when covariance matrices of the two classes were unequal, a (class) unbalanced data set had a negative effect on the performance of linear discriminant analysis (LDA). Through re-balancing 10 real-world data sets, Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562] provided empirical evidence to support the claim using AUC (Area Under the receiver operating characteristic Curve) as the performance metric. We suggest that such a claim is vague if not misleading, there is no solid theoretical analysis presented in Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562], and AUC can lead to a quite different conclusion from that led to by misclassification error rate (ER) on the discrimination performance of LDA for unbalanced data sets. Our empirical and simulation studies suggest that, for LDA, the increase of the median of AUC (and thus the improvement of performance of LDA) from re-balancing is relatively small, while, in contrast, the increase of the median of ER (and thus the decline in performance of LDA) from re-balancing is relatively large. Therefore, from our study, there is no reliable empirical evidence to support the claim that a (class) unbalanced data set has a negative effect on the performance of LDA. In addition, re-balancing affects the performance of LDA for data sets with either equal or unequal covariance matrices, indicating that having unequal covariance matrices is not a key reason for the difference in performance between original and re-balanced data
Short note on two output-dependent hidden Markov models
The purpose of this note is to study the assumption of mutual information independence", which is used by Zhou (2005) for deriving an output-dependent hidden Markov model, the so-called discriminative HMM (D-HMM), in the context of determining a stochastic optimal sequence of hidden states. The assumption is extended to derive its generative counterpart, the G-HMM. In addition, state-dependent representations for two output-dependent HMMs, namely HMMSDO (Li, 2005) and D-HMM, are presented
Learning Mixtures of Gaussians in High Dimensions
Efficiently learning mixture of Gaussians is a fundamental problem in
statistics and learning theory. Given samples coming from a random one out of k
Gaussian distributions in Rn, the learning problem asks to estimate the means
and the covariance matrices of these Gaussians. This learning problem arises in
many areas ranging from the natural sciences to the social sciences, and has
also found many machine learning applications. Unfortunately, learning mixture
of Gaussians is an information theoretically hard problem: in order to learn
the parameters up to a reasonable accuracy, the number of samples required is
exponential in the number of Gaussian components in the worst case. In this
work, we show that provided we are in high enough dimensions, the class of
Gaussian mixtures is learnable in its most general form under a smoothed
analysis framework, where the parameters are randomly perturbed from an
adversarial starting point. In particular, given samples from a mixture of
Gaussians with randomly perturbed parameters, when n > {\Omega}(k^2), we give
an algorithm that learns the parameters with polynomial running time and using
polynomial number of samples. The central algorithmic ideas consist of new ways
to decompose the moment tensor of the Gaussian mixture by exploiting its
structural properties. The symmetries of this tensor are derived from the
combinatorial structure of higher order moments of Gaussian distributions
(sometimes referred to as Isserlis' theorem or Wick's theorem). We also develop
new tools for bounding smallest singular values of structured random matrices,
which could be useful in other smoothed analysis settings
Microstructure Effects on Daily Return Volatility in Financial Markets
We simulate a series of daily returns from intraday price movements initiated
by microstructure elements. Significant evidence is found that daily returns
and daily return volatility exhibit first order autocorrelation, but trading
volume and daily return volatility are not correlated, while intraday
volatility is. We also consider GARCH effects in daily return series and show
that estimates using daily returns are biased from the influence of the level
of prices. Using daily price changes instead, we find evidence of a significant
GARCH component. These results suggest that microstructure elements have a
considerable influence on the return generating process.Comment: 15 pages, as presented at the Complexity Workshop in Aix-en-Provenc
D-optimal designs via a cocktail algorithm
A fast new algorithm is proposed for numerical computation of (approximate)
D-optimal designs. This "cocktail algorithm" extends the well-known vertex
direction method (VDM; Fedorov 1972) and the multiplicative algorithm (Silvey,
Titterington and Torsney, 1978), and shares their simplicity and monotonic
convergence properties. Numerical examples show that the cocktail algorithm can
lead to dramatically improved speed, sometimes by orders of magnitude, relative
to either the multiplicative algorithm or the vertex exchange method (a variant
of VDM). Key to the improved speed is a new nearest neighbor exchange strategy,
which acts locally and complements the global effect of the multiplicative
algorithm. Possible extensions to related problems such as nonparametric
maximum likelihood estimation are mentioned.Comment: A number of changes after accounting for the referees' comments
including new examples in Section 4 and more detailed explanations throughou
A Bayesian reassessment of nearest-neighbour classification
The k-nearest-neighbour procedure is a well-known deterministic method used
in supervised classification. This paper proposes a reassessment of this
approach as a statistical technique derived from a proper probabilistic model;
in particular, we modify the assessment made in a previous analysis of this
method undertaken by Holmes and Adams (2002,2003), and evaluated by Manocha and
Girolami (2007), where the underlying probabilistic model is not completely
well-defined. Once a clear probabilistic basis for the k-nearest-neighbour
procedure is established, we derive computational tools for conducting Bayesian
inference on the parameters of the corresponding model. In particular, we
assess the difficulties inherent to pseudo-likelihood and to path sampling
approximations of an intractable normalising constant, and propose a perfect
sampling strategy to implement a correct MCMC sampler associated with our
model. If perfect sampling is not available, we suggest using a Gibbs sampling
approximation. Illustrations of the performance of the corresponding Bayesian
classifier are provided for several benchmark datasets, demonstrating in
particular the limitations of the pseudo-likelihood approximation in this
set-up
Research informed sustainable development through art and design pedagogic practices
This paper explores a pedagogic case study, which embeds academic research activity into a masters level unit of study. Students were invited to work alongside the LiFE ‘Living in Future Ecologies’ research group at Manchester School of Art to collaboratively investigate themes for sustainable development within a city context. Pomona Island, a brownfield site on the boarders of Manchester, Salford and Trafford presented a context for complex issues of local government, and questions of international relevance on resilience and responsible urban planning. Through learning about the landscape and sensitive ecology of the island, students and researchers explored notions of context, climate, visions for future living, the opportunities and the responsibility of art and design practices in steering social reasoning within a neoliberal system. This paper presents a carefully considered enquiry-based framework, analysing academic questioning that has enabled the transformation of the ephemeral and immaterial into a methodology to address misguided political agendas. The paper articulates the different methods used to embed research practice in the learning environment. This type of project also fully illustrates innovative learning and teaching methods as ways in which art and design practices can uniquely engage with and stimulate thinking to influence and nurture change. Through presenting responses from a psychogeographical walk for Manchester European City of Science in July 2016, a conversational, transformative tool for learning was developed. Reflections on the project further evaluate the multi-disciplinary interpretations, already collated in a collaborative publication with the Pomona community and publisher Gaia Project
An approximate Bayesian marginal likelihood approach for estimating finite mixtures
Estimation of finite mixture models when the mixing distribution support is
unknown is an important problem. This paper gives a new approach based on a
marginal likelihood for the unknown support. Motivated by a Bayesian Dirichlet
prior model, a computationally efficient stochastic approximation version of
the marginal likelihood is proposed and large-sample theory is presented. By
restricting the support to a finite grid, a simulated annealing method is
employed to maximize the marginal likelihood and estimate the support. Real and
simulated data examples show that this novel stochastic
approximation--simulated annealing procedure compares favorably to existing
methods.Comment: 16 pages, 1 figure, 3 table
Quantitative assessment of sewer overflow performance with climate change in northwest England
Changes in rainfall patterns associated with climate change can affect the operation of a combined sewer system, with the potential increase in rainfall amount. This could lead to excessive spill frequencies and could also introduce hazardous substances into the receiving waters, which, in turn, would have an impact on the quality of shellfish and bathing waters. This paper quantifies the spilling volume, duration and frequency of 19 combined sewer overflows (CSOs) to receiving waters under two climate change scenarios, the high (A1FI), and the low emissions (B1) scenarios, simulated by three global climate models (GCMs), for a study catchment in northwest England. The future rainfall is downscaled, using climatic variables from HadCM3, CSIRO and CGCM2 GCMs, with the use of a hybrid generalized linear–artificial neural network model. The results from the model simulation for the future in 2080 showed an annual increase of 37% in total spill volume, 32% in total spill duration, and 12% in spill frequency for the shellfish water limiting requirements. These results were obtained, under the high emissions scenario, as projected by the HadCM3 as maximum. Nevertheless, the catchment drainage system is projected to cope with the future conditions in 2080 by all three GCMs. The results also indicate that under scenario B1, a significant drop was projected by CSIRO, which in the worst case could reach up to 50% in spill volume, 39% in spill duration and 25% in spill frequency. The results further show that, during the bathing season, a substantial drop is expected in the CSO spill drivers, as predicted by all GCMs under both scenarios
- …